House Prices

CUNY SPS DATA 621

GROUP 2: William Aiken, Donald Butler, Michael Ippolito, Bharani Nittala, and Leticia Salazar

\(~\)

Load Libraries:

These are the libraries used to explore, prepare, analyze and build our models

library(tidyverse)
library(dplyr)
library(corrplot)
library(MASS)
library(dvmisc)
library(car)
library(lmtest)
library(olsrr)
library(caret)
library(kableExtra)
library(hrbrthemes)

Load Data:

We have included the original data sets in our GitHub account and read from this location. Since our data set doesn’t come with a training and evaluation data sets we will be splitting our data using the 70% - 30% split. Below we are showing the training data set:
Table 1. Training data sample
price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating airconditioning parking prefarea furnishingstatus
1 13300000 7420 4 2 3 yes no no no yes 2 yes furnished
2 12250000 8960 4 4 4 yes no no no yes 3 no furnished
3 12250000 9960 3 2 2 yes no yes no no 2 yes semi-furnished
5 11410000 7420 4 1 2 yes yes yes no yes 2 no furnished
8 10150000 16200 5 3 2 yes no no no no 0 no unfurnished
9 9870000 8100 4 1 2 yes yes yes no yes 2 yes furnished

\(~\)

Data Exploration:

Based on this our training data includes 386 records and 13 variables whereas the evaluation data includes 159 records and 13 variables.

Training:

## 'data.frame':    386 obs. of  13 variables:
##  $ price           : int  13300000 12250000 12250000 11410000 10150000 9870000 9800000 9800000 9681000 9310000 ...
##  $ area            : int  7420 8960 9960 7420 16200 8100 5750 13200 6000 6550 ...
##  $ bedrooms        : int  4 4 3 4 5 4 3 3 4 4 ...
##  $ bathrooms       : int  2 4 2 1 3 1 2 1 3 2 ...
##  $ stories         : int  3 4 2 2 2 2 4 2 2 2 ...
##  $ mainroad        : chr  "yes" "yes" "yes" "yes" ...
##  $ guestroom       : chr  "no" "no" "no" "yes" ...
##  $ basement        : chr  "no" "no" "yes" "yes" ...
##  $ hotwaterheating : chr  "no" "no" "no" "no" ...
##  $ airconditioning : chr  "yes" "yes" "no" "yes" ...
##  $ parking         : int  2 3 2 2 0 2 1 2 2 1 ...
##  $ prefarea        : chr  "yes" "no" "yes" "no" ...
##  $ furnishingstatus: chr  "furnished" "furnished" "semi-furnished" "furnished" ...

\(~\)

Evaluation:

## 'data.frame':    159 obs. of  13 variables:
##  $ price           : int  12215000 10850000 10150000 9240000 9100000 8960000 8855000 8750000 8400000 8120000 ...
##  $ area            : int  7500 7500 8580 7800 6600 8500 6420 4320 7950 6840 ...
##  $ bedrooms        : int  4 3 4 3 4 3 3 3 5 5 ...
##  $ bathrooms       : int  2 3 3 2 2 2 2 1 2 1 ...
##  $ stories         : int  2 1 4 2 2 4 2 2 2 2 ...
##  $ mainroad        : chr  "yes" "yes" "yes" "yes" ...
##  $ guestroom       : chr  "no" "no" "no" "no" ...
##  $ basement        : chr  "yes" "yes" "no" "no" ...
##  $ hotwaterheating : chr  "no" "no" "no" "no" ...
##  $ airconditioning : chr  "yes" "yes" "yes" "no" ...
##  $ parking         : int  3 2 2 0 1 2 1 2 2 1 ...
##  $ prefarea        : chr  "yes" "yes" "yes" "yes" ...
##  $ furnishingstatus: chr  "furnished" "semi-furnished" "semi-furnished" "semi-furnished" ...

\(~\)

Using the summary() function lets start exploring the training and evaluation data.

Training:

##      price               area          bedrooms       bathrooms   
##  Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.00  
##  1st Qu.: 3473750   1st Qu.: 3588   1st Qu.:2.000   1st Qu.:1.00  
##  Median : 4340000   Median : 4600   Median :3.000   Median :1.00  
##  Mean   : 4763635   Mean   : 5178   Mean   :2.953   Mean   :1.28  
##  3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.00  
##  Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.00  
##     stories        mainroad          guestroom           basement        
##  Min.   :1.000   Length:386         Length:386         Length:386        
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.793                                                           
##  3rd Qu.:2.000                                                           
##  Max.   :4.000                                                           
##  hotwaterheating    airconditioning       parking        prefarea        
##  Length:386         Length:386         Min.   :0.000   Length:386        
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.715                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :3.000                     
##  furnishingstatus  
##  Length:386        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

\(~\)

Evaluation:

##      price               area          bedrooms       bathrooms    
##  Min.   : 1767150   Min.   : 1836   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:3.000   1st Qu.:1.000  
##  Median : 4270000   Median : 4500   Median :3.000   Median :1.000  
##  Mean   : 4774240   Mean   : 5083   Mean   :2.994   Mean   :1.302  
##  3rd Qu.: 5771500   3rd Qu.: 6450   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :12215000   Max.   :12944   Max.   :5.000   Max.   :3.000  
##     stories        mainroad          guestroom           basement        
##  Min.   :1.000   Length:159         Length:159         Length:159        
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.836                                                           
##  3rd Qu.:2.000                                                           
##  Max.   :4.000                                                           
##  hotwaterheating    airconditioning       parking         prefarea        
##  Length:159         Length:159         Min.   :0.0000   Length:159        
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.6415                     
##                                        3rd Qu.:1.0000                     
##                                        Max.   :3.0000                     
##  furnishingstatus  
##  Length:159        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

\(~\)

Price:

It is important to recognize that this dataset contains homes with prices above 1 million. It is not clear that this is a US dataset, which would indicate that this is for luxury homes and/or high value markets.

\(~\)

Area:

The area variable appears to be square footage of the home. We would traditionally expect that increases in area would lead to increases in price.

\(~\)

Bedrooms:

While we expect increases in the number of bedrooms to increase the price, we also realize that at some point there are diminishing returns that an additional bedroom doesn’t have as much of an impact. For example, increasing from one to two bedrooms should have significant increase in price, while increasing from four to five, perhaps not so much.

##   bedrooms   n
## 1        1   1
## 2        2 102
## 3        3 207
## 4        4  68
## 5        5   6
## 6        6   2

Based on the distribution of the number of Bedrooms, it may be best to categorize these with dummy variables; 2, 3, and 4+.

\(~\)

Bathrooms:

Similar to the number of bedrooms, we would expect that an increase in bathroom count would lead to increases in price. Although similarly, having more than four bathrooms is likely going to lead to smaller increases.

##   bathrooms   n
## 1         1 288
## 2         2  89
## 3         3   8
## 4         4   1

Based on the distribution of the number of bathrooms, it may be best to categorize these with dummy variables; 2, and 3+.

\(~\)

Stories:

Similar to the number of bedrooms and bathrooms, it would seem to make sense to classify homes with 3 or more floors together by introducing dummy variables; 2, and 3+.

##   stories   n
## 1       1 169
## 2       2 161
## 3       3  23
## 4       4  33

\(~\)

Parking:

We are assuming that the parking variable represents the size of a garage. Similar to other variable the increase in price from no garage to a one car garage would be significant, while additional cars would add some lesser value. It would initially seem to make sense to introduce dummy variables; 1, and 2+.

##   parking   n
## 1       0 203
## 2       1  97
## 3       2  79
## 4       3   7

\(~\)

Furnishing Status:

The furnishing status variable is taking on three values; unfurnished, semi-furnished, and furnished. Since we would consider unfurnished as the default state, we will use dummy variables; semi-furnished and furnished.

##   furnishingstatus   n
## 1        furnished 103
## 2   semi-furnished 160
## 3      unfurnished 123

\(~\)

Main Road:

The main road variable is yes/no based on the street of the home. We will replace this with a dummy variable.

##   mainroad   n
## 1       no  50
## 2      yes 336

\(~\)

Guest Room:

The guest room variable is yes/no based on the home having a guest room. It is unclear from the dataset source if this is in addition to the number of bedrooms, but we would expect houses with a guest room to have a higher price. We will replace this with a dummy variable.

##   guestroom   n
## 1        no 312
## 2       yes  74

\(~\)

Basement:

The basement variable is yes/no based on the home having a basement. It is unclear if having a basement or not would lead to an increase in home price, but we will replace this with a dummy variable for analysis.

##   basement   n
## 1       no 249
## 2      yes 137

\(~\)

Hot Water Heating:

Based on the distribution, we assume that the hot water heating variable represents if the house has in-floor heating, rather than forced air. Based on this assumption, we assume that having this feature would lead to higher house price. The variable will be replaced with a dummy variable for analysis.

##   hotwaterheating   n
## 1              no 366
## 2             yes  20

\(~\)

Air Conditioning:

The air conditioning variable indicates if the house has central air conditioning. We would expect homes with air conditioning would have a higher price than those without. The variable will be replaced with a dummy variable.

##   airconditioning   n
## 1              no 264
## 2             yes 122

\(~\)

Preferential Area:

The dataset source doesn’t specify exactly what this variable represents. We are assuming that this is a yes/no value if the house is in a preferred neighborhood. We would expect houses with a yes to be higher price than those not.

##   prefarea   n
## 1       no 298
## 2      yes  88

\(~\)

Data Preparation:

Based on our exploration, we do not have any blank values in our dataset.

Clean Function:

We will introduce a clean function to replace our categorical variables with the dummy values. This will also ensure that our test and train datasets are processed in the same way.

\(~\)

Data Visualization:

Visual evaluation:

\(~\)

Correlation Plot:

After cleaning the dataset looking at a correlation plot will give us confirmation about our initial examination for the variables.

The correlation plot generally confirms our initial expectations for the data.

\(~\)

Model Building:

Multiple Linear Regression Models

\(~\)

MLR Model 1

## 
## Call:
## lm(formula = price ~ ., data = model_lin_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2778242  -618889   -69359   502507  5058478 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     1328656.61 1089042.62   1.220             0.223251    
## area                259.46      29.19   8.889 < 0.0000000000000002 ***
## mainroad         392612.63  177649.27   2.210             0.027727 *  
## guestroom        387128.55  154049.10   2.513             0.012404 *  
## basement         340379.30  132393.80   2.571             0.010540 *  
## hotwaterheating  454981.22  254142.03   1.790             0.074247 .  
## bed2            -135521.62 1079765.76  -0.126             0.900189    
## bed3              84970.58 1084129.57   0.078             0.937572    
## bed4             294373.83 1095943.45   0.269             0.788388    
## bed5             349023.44 1186154.54   0.294             0.768737    
## bed6plus         822411.56 1320048.36   0.623             0.533666    
## bath2            823033.31  148991.77   5.524         0.0000000634 ***
## bath3           1711486.52  412100.64   4.153         0.0000409553 ***
## bath4plus       5939390.79 1173229.90   5.062         0.0000006603 ***
## floor2           369286.09  145200.74   2.543             0.011397 *  
## floor3           917915.07  262415.57   3.498             0.000527 ***
## floor4plus      1368891.57  247650.70   5.528         0.0000000623 ***
## car1             350724.78  139562.29   2.513             0.012404 *  
## car2             597602.32  154287.27   3.873             0.000127 ***
## car3plus        -694646.46  454023.80  -1.530             0.126896    
## semifurnished    386745.44  133594.69   2.895             0.004023 ** 
## furnished        533608.97  151214.70   3.529             0.000471 ***
## ac               762389.97  133720.79   5.701         0.0000000247 ***
## neighborhood     666169.08  141563.01   4.706         0.0000036049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1066000 on 362 degrees of freedom
## Multiple R-squared:  0.6949, Adjusted R-squared:  0.6755 
## F-statistic: 35.85 on 23 and 362 DF,  p-value: < 0.00000000000000022

\(~\)

MLR Model 2

## 
## Call:
## lm(formula = price ~ area + mainroad + guestroom + basement + 
##     hotwaterheating + bed2 + bath2 + bath3 + bath4plus + floor2 + 
##     floor3 + floor4plus + car1 + car2 + car3plus + semifurnished + 
##     furnished + ac + neighborhood, data = model_lin_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2606156  -623517   -76743   477682  5170369 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)     1427060.25  221172.14   6.452        0.00000000035 ***
## area                262.55      28.72   9.142 < 0.0000000000000002 ***
## mainroad         371018.49  174144.11   2.131             0.033795 *  
## guestroom        388933.37  153592.89   2.532             0.011751 *  
## basement         327059.67  131426.10   2.489             0.013271 *  
## hotwaterheating  440059.27  252885.91   1.740             0.082673 .  
## bed2            -242263.06  151828.03  -1.596             0.111432    
## bath2            871714.49  145028.40   6.011        0.00000000447 ***
## bath3           1782594.89  398633.12   4.472        0.00001036418 ***
## bath4plus       6100918.97 1164426.07   5.239        0.00000027257 ***
## floor2           427081.88  139028.59   3.072             0.002286 ** 
## floor3           944710.16  259836.19   3.636             0.000317 ***
## floor4plus      1394855.82  245550.02   5.681        0.00000002741 ***
## car1             354677.24  138326.44   2.564             0.010744 *  
## car2             606621.41  153181.86   3.960        0.00009002748 ***
## car3plus        -694809.66  452826.31  -1.534             0.125799    
## semifurnished    390379.23  132434.98   2.948             0.003407 ** 
## furnished        540998.79  150081.46   3.605             0.000356 ***
## ac               757477.86  133121.53   5.690        0.00000002604 ***
## neighborhood     661882.68  141156.08   4.689        0.00000388102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1064000 on 366 degrees of freedom
## Multiple R-squared:  0.6927, Adjusted R-squared:  0.6767 
## F-statistic: 43.42 on 19 and 366 DF,  p-value: < 0.00000000000000022

\(~\)

MLR Model 3

## 
## Call:
## lm(formula = price ~ area + guestroom + basement + bath2 + bath3 + 
##     bath4plus + floor2 + floor3 + floor4plus + car1 + car2 + 
##     car3plus + semifurnished + furnished + ac + neighborhood - 
##     car3plus - bed2, data = model_lin_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2649096  -673238   -43507   477530  5001125 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   1503879.37  170520.37   8.819 < 0.0000000000000002 ***
## area              270.76      27.87   9.715 < 0.0000000000000002 ***
## guestroom      408948.95  154956.82   2.639             0.008664 ** 
## basement       355323.88  131895.11   2.694             0.007382 ** 
## bath2          890196.07  145115.53   6.134        0.00000000220 ***
## bath3         1780896.41  401261.08   4.438        0.00001198413 ***
## bath4plus     5494328.88 1105325.41   4.971        0.00000102187 ***
## floor2         550003.46  123318.63   4.460        0.00001088514 ***
## floor3        1161098.75  249130.18   4.661        0.00000440677 ***
## floor4plus    1495717.83  237291.27   6.303        0.00000000083 ***
## car1           395580.85  136995.37   2.888             0.004111 ** 
## car2           706158.82  151485.48   4.662        0.00000438759 ***
## semifurnished  421633.75  132631.94   3.179             0.001602 ** 
## furnished      569389.00  150773.07   3.776             0.000185 ***
## ac             760707.49  132467.98   5.743        0.00000001947 ***
## neighborhood   698432.62  139650.80   5.001        0.00000088145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1074000 on 370 degrees of freedom
## Multiple R-squared:  0.6831, Adjusted R-squared:  0.6703 
## F-statistic: 53.18 on 15 and 370 DF,  p-value: < 0.00000000000000022

\(~\)

Model Selection:

Verify linear modeling assumptions:

## [1] "--------------------------------------------------"
## lm(formula = price ~ area + mainroad + guestroom + basement + 
##     hotwaterheating + bed2 + bath2 + bath3 + bath4plus + floor2 + 
##     floor3 + floor4plus + car1 + car2 + car3plus + semifurnished + 
##     furnished + ac + neighborhood, data = model_lin_train)

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0000000000102740426604147 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.0000153934723145733 is <= 0.05 and the test statistic is 56.1639472213671, so reject the null; i.e., the residuals are HETEROSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##      floor4plus          floor2            bed2       furnished   semifurnished 
##        1.607734        1.602737        1.528509        1.502877        1.451709 
##            area        basement              ac            car2          floor3 
##        1.372255        1.348742        1.306490        1.302644        1.290267 
##           bath2       guestroom        car3plus            car1    neighborhood 
##        1.272619        1.246735        1.245221        1.227795        1.196035 
##       bath4plus        mainroad           bath3 hotwaterheating 
##        1.194896        1.166199        1.099953        1.071534 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.677"
## [1] "    AIC: 11830.246"
## [1] "    BIC: 11913.319"
## [1] "    Mallow's Cp: 20"
## [1] "    mean squared error: 1073150824664.42"
## [1] ""
## [1] "Leverage point cutoff: 0.10880829015544"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #2: 1"
## [1] "    case #5: 0.219"
## [1] "    case #9: 0.184"
## [1] "    case #25: 0.155"
## [1] "    case #33: 0.209"
## [1] "    case #49: 0.128"
## [1] "    case #62: 0.149"
## [1] "    case #103: 0.133"
## [1] "    case #110: 0.15"
## [1] "    case #136: 0.149"
## [1] "    case #156: 0.201"
## [1] ""
## [1] "--------------------------------------------------"
## lm(formula = price ~ area + guestroom + basement + bath2 + bath3 + 
##     bath4plus + floor2 + floor3 + floor4plus + car1 + car2 + 
##     car3plus + semifurnished + furnished + ac + neighborhood - 
##     car3plus - bed2, data = model_lin_train)

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0000000000133366117207675 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.000000924975310233364 is <= 0.05 and the test statistic is 56.6934694356472, so reject the null; i.e., the residuals are HETEROSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##     furnished    floor4plus semifurnished      basement            ac 
##      1.487044      1.471990      1.427504      1.331772      1.268347 
##          area         bath2          car2     guestroom        floor2 
##      1.267011      1.249185      1.248993      1.244114      1.236285 
##          car1        floor3  neighborhood         bath3     bath4plus 
##      1.180685      1.162893      1.147727      1.092669      1.055586 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.67"
## [1] "    AIC: 11834.079"
## [1] "    BIC: 11901.328"
## [1] "    Mallow's Cp: 16"
## [1] "    mean squared error: 1106558754047.73"
## [1] ""
## [1] "Leverage point cutoff: 0.0880829015544041"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #2: 1"
## [1] "    case #5: 0.215"
## [1] "    case #9: 0.143"
## [1] "    case #25: 0.15"
## [1] "    case #62: 0.148"
## [1] "    case #110: 0.146"
## [1] "    case #136: 0.146"
## [1] "    case #210: 0.144"
## [1] "    case #232: 0.091"
## [1] "    case #355: 0.15"
## [1] ""

\(~\)

Due to non-normal distribution and heteroschedasticity of residuals, try a transform. Use Box-Cox to estimate what kind of transform is appropriate.

## Estimated transformation parameter 
##         Y1 
## 0.08222126
## 
## Call:
## lm(formula = log(price) ~ ., data = model_lin_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58768 -0.12578 -0.00157  0.13060  0.60404 
## 
## Coefficients:
##                     Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)     14.544518392  0.212111659  68.570 < 0.0000000000000002 ***
## area             0.000050911  0.000005685   8.956 < 0.0000000000000002 ***
## mainroad         0.106432736  0.034600557   3.076             0.002257 ** 
## guestroom        0.072860681  0.030003978   2.428             0.015653 *  
## basement         0.092487862  0.025786197   3.587             0.000381 ***
## hotwaterheating  0.085236067  0.049498970   1.722             0.085928 .  
## bed2            -0.017549893  0.210304814  -0.083             0.933540    
## bed3             0.058477567  0.211154749   0.277             0.781983    
## bed4             0.083540318  0.213455726   0.391             0.695754    
## bed5             0.091411039  0.231026043   0.396             0.692578    
## bed6plus         0.256317014  0.257104398   0.997             0.319461    
## bath2            0.150431917  0.029018967   5.184        0.00000036205 ***
## bath3            0.302837566  0.080264398   3.773             0.000188 ***
## bath4plus        0.684383498  0.228508725   2.995             0.002933 ** 
## floor2           0.057366590  0.028280592   2.028             0.043243 *  
## floor3           0.196068831  0.051110399   3.836             0.000147 ***
## floor4plus       0.264755508  0.048234660   5.489        0.00000007621 ***
## car1             0.071065993  0.027182396   2.614             0.009311 ** 
## car2             0.092977637  0.030050365   3.094             0.002128 ** 
## car3plus        -0.109608169  0.088429727  -1.239             0.215965    
## semifurnished    0.144544389  0.026020095   5.555        0.00000005383 ***
## furnished        0.136319806  0.029451924   4.629        0.00000513855 ***
## ac               0.154534113  0.026044654   5.933        0.00000000694 ***
## neighborhood     0.127050640  0.027572075   4.608        0.00000564357 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2076 on 362 degrees of freedom
## Multiple R-squared:  0.7091, Adjusted R-squared:  0.6906 
## F-statistic: 38.37 on 23 and 362 DF,  p-value: < 0.00000000000000022
## 
## Call:
## lm(formula = log(price) ~ area + mainroad + guestroom + basement + 
##     hotwaterheating + bed3 + bed4 + bed6plus + bath2 + bath3 + 
##     bath4plus + floor2 + floor3 + floor4plus + car1 + car2 + 
##     semifurnished + furnished + ac + neighborhood, data = model_lin_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58509 -0.12700  0.00047  0.12936  0.60034 
## 
## Coefficients:
##                     Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)     14.535084196  0.040152865 361.994 < 0.0000000000000002 ***
## area             0.000050982  0.000005498   9.273 < 0.0000000000000002 ***
## mainroad         0.098425270  0.033966231   2.898             0.003985 ** 
## guestroom        0.073094605  0.029965558   2.439             0.015192 *  
## basement         0.094260108  0.025696821   3.668             0.000281 ***
## hotwaterheating  0.088006202  0.049413883   1.781             0.075744 .  
## bed3             0.064144656  0.028343442   2.263             0.024215 *  
## bed4             0.085823272  0.039599484   2.167             0.030859 *  
## bed6plus         0.263077810  0.150898238   1.743             0.082104 .  
## bath2            0.153013605  0.028971974   5.281        0.00000022073 ***
## bath3            0.326495664  0.077959203   4.188        0.00003531575 ***
## bath4plus        0.586803921  0.215060901   2.729             0.006669 ** 
## floor2           0.066285069  0.027306452   2.427             0.015688 *  
## floor3           0.208868710  0.050236039   4.158        0.00004009382 ***
## floor4plus       0.265805557  0.047723404   5.570        0.00000004958 ***
## car1             0.071743162  0.026735578   2.683             0.007619 ** 
## car2             0.097614195  0.029751245   3.281             0.001134 ** 
## semifurnished    0.142899486  0.025928567   5.511        0.00000006745 ***
## furnished        0.136023794  0.029314267   4.640        0.00000485991 ***
## ac               0.156272867  0.026009414   6.008        0.00000000454 ***
## neighborhood     0.128037921  0.027532296   4.650        0.00000463690 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2076 on 365 degrees of freedom
## Multiple R-squared:  0.7068, Adjusted R-squared:  0.6907 
## F-statistic: 43.99 on 20 and 365 DF,  p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = log(price) ~ area + mainroad + guestroom + basement + 
##     hotwaterheating + bed3 + bed4 + bed6plus + bath2 + bath3 + 
##     bath4plus + floor2 + floor3 + floor4plus + car1 + car2 + 
##     semifurnished + furnished + ac + neighborhood, data = model_lin_train)
## Warning: not plotting observations with leverage one:
##   2

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0274288341817099 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.221006362177854 and test statistic of 24.5055362047153 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##            bed4            bed3          floor2      floor4plus       furnished 
##        2.039063        1.789964        1.624256        1.595391        1.506251 
##   semifurnished        basement           bath2            area              ac 
##        1.461844        1.354548        1.334192        1.321166        1.310209 
##            car2          floor3       guestroom            car1    neighborhood 
##        1.290896        1.267012        1.246656        1.204937        1.195362 
##        mainroad           bath3 hotwaterheating       bath4plus        bed6plus 
##        1.165520        1.105178        1.074793        1.070778        1.051587 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.691"
## [1] "    AIC: -96.006"
## [1] "    BIC: -8.978"
## [1] "    Mallow's Cp: 21"
## [1] "    mean squared error: 0.041"
## [1] ""
## [1] "Leverage point cutoff: 0.113989637305699"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #2: 1"
## [1] "    case #5: 0.232"
## [1] "    case #9: 0.193"
## [1] "    case #25: 0.159"
## [1] "    case #49: 0.124"
## [1] "    case #62: 0.163"
## [1] "    case #77: 0.515"
## [1] "    case #103: 0.153"
## [1] "    case #110: 0.155"
## [1] "    case #136: 0.157"
## [1] "    case #210: 0.165"
## [1] ""

\(~\)

Investigate outliers.
Table 2. Outliers
price area mainroad guestroom basement hotwaterheating bed2 bed3 bed4 bed5 bed6plus bath2 bath3 bath4plus floor2 floor3 floor4plus car1 car2 car3plus semifurnished furnished ac neighborhood
2 12250000 8960 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0
8 10150000 16200 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
12 9681000 6000 1 1 1 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0
34 8190000 5960 1 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0
67 6930000 13200 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0
90 6440000 8580 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0
113 6083000 4300 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0
144 5600000 4800 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
154 5530000 3300 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0
196 4970000 4410 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0
291 4200000 2610 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0

\(~\)

Investigation of outliers reveals no obvious pattern, so we have to assume there is some other variable at play that we don’t have data for (e.g. high-end appliances, presence of a pool, property condition, etc). Well remove the outliers and re-run model.

## 
## Call:
## lm(formula = formula(lm_mod5), data = model_lin_train[c(-2, -5, 
##     -9, -25, -49, -62, -77, -103, -110, -136, -210), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58650 -0.12611 -0.00086  0.12165  0.62001 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)     14.552077340  0.041028854 354.679 < 0.0000000000000002 ***
## area             0.000046391  0.000005973   7.767   0.0000000000000873 ***
## mainroad         0.094132064  0.034584153   2.722             0.006812 ** 
## guestroom        0.078800876  0.030481181   2.585             0.010130 *  
## basement         0.085255655  0.026311384   3.240             0.001307 ** 
## hotwaterheating  0.057759412  0.052990633   1.090             0.276456    
## bed3             0.078248863  0.029177990   2.682             0.007665 ** 
## bed4             0.102622190  0.040833608   2.513             0.012407 *  
## bed6plus         0.120709432  0.210035997   0.575             0.565853    
## bath2            0.147071512  0.029194002   5.038   0.0000007515098156 ***
## bath3           -0.142253973  0.212242318  -0.670             0.503139    
## bath4plus                 NA           NA      NA                   NA    
## floor2           0.058563926  0.027551773   2.126             0.034227 *  
## floor3           0.191284890  0.051435131   3.719             0.000232 ***
## floor4plus       0.260516774  0.047628306   5.470   0.0000000851899506 ***
## car1             0.075782849  0.027081148   2.798             0.005417 ** 
## car2             0.104147967  0.030074398   3.463             0.000599 ***
## semifurnished    0.150000199  0.026299921   5.703   0.0000000247524034 ***
## furnished        0.140301991  0.029667427   4.729   0.0000032579008657 ***
## ac               0.159133762  0.025897699   6.145   0.0000000021516006 ***
## neighborhood     0.132549803  0.027450937   4.829   0.0000020472915666 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2063 on 355 degrees of freedom
## Multiple R-squared:  0.6997, Adjusted R-squared:  0.6836 
## F-statistic: 43.53 on 19 and 355 DF,  p-value: < 0.00000000000000022
## 
## Call:
## lm(formula = log(price) ~ area + mainroad + guestroom + basement + 
##     bed3 + bed4 + bath2 + floor2 + floor3 + floor4plus + car1 + 
##     car2 + semifurnished + furnished + ac + neighborhood, data = model_lin_train[c(-2, 
##     -5, -9, -25, -49, -62, -77, -103, -110, -136, -210), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58873 -0.12342  0.00127  0.12844  0.67130 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   14.552303788  0.040688236 357.654 < 0.0000000000000002 ***
## area           0.000045884  0.000005951   7.710    0.000000000000125 ***
## mainroad       0.097414780  0.034245452   2.845             0.004702 ** 
## guestroom      0.076896113  0.030070499   2.557             0.010964 *  
## basement       0.085065566  0.026142098   3.254             0.001246 ** 
## bed3           0.078025035  0.028949410   2.695             0.007366 ** 
## bed4           0.100743111  0.040624077   2.480             0.013602 *  
## bath2          0.149409677  0.029079309   5.138    0.000000457356246 ***
## floor2         0.059460729  0.027314842   2.177             0.030143 *  
## floor3         0.195417664  0.051220442   3.815             0.000160 ***
## floor4plus     0.259324130  0.047494588   5.460    0.000000089151967 ***
## car1           0.081547827  0.026701058   3.054             0.002426 ** 
## car2           0.106210924  0.029971879   3.544             0.000447 ***
## semifurnished  0.152667330  0.026031467   5.865    0.000000010235661 ***
## furnished      0.141615566  0.029508058   4.799    0.000002343150712 ***
## ac             0.155994408  0.025661908   6.079    0.000000003102744 ***
## neighborhood   0.130735463  0.027358190   4.779    0.000002579879496 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.206 on 358 degrees of freedom
## Multiple R-squared:  0.698,  Adjusted R-squared:  0.6845 
## F-statistic: 51.71 on 16 and 358 DF,  p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = log(price) ~ area + mainroad + guestroom + basement + 
##     bed3 + bed4 + bath2 + floor2 + floor3 + floor4plus + car1 + 
##     car2 + semifurnished + furnished + ac + neighborhood, data = model_lin_train[c(-2, 
##     -5, -9, -25, -49, -62, -77, -103, -110, -136, -210), ])

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0135819482901162 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.0840661091923266 and test statistic of 24.2557228274724 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##          bed4          bed3        floor2    floor4plus     furnished 
##      2.063505      1.834765      1.591989      1.555359      1.494596 
## semifurnished          area      basement         bath2          car2 
##      1.454334      1.382581      1.372272      1.331000      1.282363 
##        floor3            ac     guestroom          car1  neighborhood 
##      1.279918      1.271438      1.239238      1.191343      1.187511 
##      mainroad 
##      1.156360 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.684"
## [1] "    AIC: -101.985"
## [1] "    BIC: -31.301"
## [1] "    Mallow's Cp: 17"
## [1] "    mean squared error: 0.041"
## [1] ""
## [1] "Leverage point cutoff: 0.096"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #48: 0.104"
## [1] "    case #62: 0.102"
## [1] "    case #80: 0.099"
## [1] "    case #84: 0.098"
## [1] "    case #221: 0.116"
## [1] ""

\(~\)

Residuals are still not normally distributed. Use robust regression to try addressing non-normality.

## 
## Call:
## lm(formula = formula(lm_mod7), data = model_lin_train, weights = lm_mod8$w)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39873 -0.13163  0.00627  0.12850  0.41563 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   14.576837195  0.036640378 397.835 < 0.0000000000000002 ***
## area           0.000049910  0.000005072   9.840 < 0.0000000000000002 ***
## mainroad       0.084162100  0.030798460   2.733             0.006585 ** 
## guestroom      0.066825026  0.027117946   2.464             0.014186 *  
## basement       0.092910569  0.023534417   3.948     0.00009443484523 ***
## bed3           0.047228877  0.025581852   1.846             0.065666 .  
## bed4           0.071650779  0.035800323   2.001             0.046081 *  
## bath2          0.145767340  0.026146814   5.575     0.00000004789330 ***
## floor2         0.087224376  0.024366301   3.580             0.000390 ***
## floor3         0.228322844  0.045009601   5.073     0.00000062234166 ***
## floor4plus     0.304215596  0.042847379   7.100     0.00000000000645 ***
## car1           0.061013969  0.024105545   2.531             0.011785 *  
## car2           0.093803011  0.027068547   3.465             0.000592 ***
## semifurnished  0.136384089  0.023452238   5.815     0.00000001311493 ***
## furnished      0.139402409  0.026699824   5.221     0.00000029770305 ***
## ac             0.142091507  0.023231442   6.116     0.00000000244205 ***
## neighborhood   0.127632227  0.024725726   5.162     0.00000040033935 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1835 on 369 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.7098 
## F-statistic: 59.86 on 16 and 369 DF,  p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = model_lin_train, weights = lm_mod8$w)

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.00149841137153798 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.726046019909312 and test statistic of 12.2578680885363 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##          bed4          bed3    floor4plus        floor2     furnished 
##      1.974297      1.763977      1.576573      1.555003      1.497120 
## semifurnished      basement         bath2          area          car2 
##      1.454589      1.369392      1.321515      1.307867      1.268015 
##            ac        floor3     guestroom  neighborhood          car1 
##      1.266215      1.243657      1.227593      1.190301      1.183226 
##      mainroad 
##      1.165607 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.71"
## [1] "    AIC: -167.558"
## [1] "    BIC: -96.353"
## [1] "    Mallow's Cp: 158.404"
## [1] "    mean squared error: 0.032"
## [1] ""
## [1] "Leverage point cutoff: 0.0932642487046632"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #53: 0.103"
## [1] "    case #68: 0.102"
## [1] "    case #87: 0.097"
## [1] "    case #90: 0.094"
## [1] "    case #103: 0.122"
## [1] ""
## 
## Call:
## lm(formula = formula(lm_mod7), data = model_lin_train[c(-53, 
##     -68, -87, -90, -103), ], weights = lm_mod8$w[c(-53, -68, 
##     -87, -90, -103)])
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39955 -0.13169 -0.00021  0.12864  0.41720 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   14.563721647  0.037189860 391.605 < 0.0000000000000002 ***
## area           0.000051366  0.000005254   9.776 < 0.0000000000000002 ***
## mainroad       0.088784687  0.031173748   2.848             0.004648 ** 
## guestroom      0.065491056  0.027695872   2.365             0.018571 *  
## basement       0.085828585  0.024084842   3.564             0.000415 ***
## bed3           0.053680427  0.026267071   2.044             0.041709 *  
## bed4           0.073893471  0.036611812   2.018             0.044294 *  
## bath2          0.142063722  0.026492819   5.362      0.0000001462959 ***
## floor2         0.084921654  0.024516679   3.464             0.000596 ***
## floor3         0.203949148  0.048755612   4.183      0.0000360781643 ***
## floor4plus     0.298337827  0.044190686   6.751      0.0000000000581 ***
## car1           0.060818177  0.024306795   2.502             0.012784 *  
## car2           0.094709759  0.027112575   3.493             0.000536 ***
## semifurnished  0.141179923  0.023615878   5.978      0.0000000053860 ***
## furnished      0.143142302  0.026980739   5.305      0.0000001957569 ***
## ac             0.145829668  0.023516123   6.201      0.0000000015211 ***
## neighborhood   0.126779027  0.025075230   5.056      0.0000006799222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1835 on 364 degrees of freedom
## Multiple R-squared:  0.7225, Adjusted R-squared:  0.7103 
## F-statistic: 59.24 on 16 and 364 DF,  p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = model_lin_train[c(-53, 
##     -68, -87, -90, -103), ], weights = lm_mod8$w[c(-53, -68, 
##     -87, -90, -103)])

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.0016011455567499 is <= 0.05, so reject the null; i.e., the residuals are NOT NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.699944941564394 and test statistic of 12.6251150042379 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##          bed4          bed3    floor4plus        floor2     furnished 
##      2.033378      1.833734      1.627263      1.559638      1.501208 
## semifurnished      basement         bath2          area            ac 
##      1.458014      1.408550      1.329111      1.312546      1.277562 
##        floor3          car2     guestroom  neighborhood          car1 
##      1.270650      1.255128      1.248403      1.199533      1.189140 
##      mainroad 
##      1.170717 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.71"
## [1] "    AIC: -164.911"
## [1] "    BIC: -93.94"
## [1] "    Mallow's Cp: 157.372"
## [1] "    mean squared error: 0.032"
## [1] ""
## [1] "Leverage point cutoff: 0.094488188976378"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #87: 0.099"
## [1] ""

\(~\)

Perform five-fold cross validation to validate our results.

## Linear Regression 
## 
## 386 samples
##  16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 308, 309, 310, 309, 308 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   0.221452  0.6481765  0.169662
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## Call:
## lm(formula = .outcome ~ ., data = dat, weights = wts)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39873 -0.13163  0.00627  0.12850  0.41563 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)   14.576837195  0.036640378 397.835 < 0.0000000000000002 ***
## area           0.000049910  0.000005072   9.840 < 0.0000000000000002 ***
## mainroad       0.084162100  0.030798460   2.733             0.006585 ** 
## guestroom      0.066825026  0.027117946   2.464             0.014186 *  
## basement       0.092910569  0.023534417   3.948     0.00009443484523 ***
## bed3           0.047228877  0.025581852   1.846             0.065666 .  
## bed4           0.071650779  0.035800323   2.001             0.046081 *  
## bath2          0.145767340  0.026146814   5.575     0.00000004789330 ***
## floor2         0.087224376  0.024366301   3.580             0.000390 ***
## floor3         0.228322844  0.045009601   5.073     0.00000062234166 ***
## floor4plus     0.304215596  0.042847379   7.100     0.00000000000645 ***
## car1           0.061013969  0.024105545   2.531             0.011785 *  
## car2           0.093803011  0.027068547   3.465             0.000592 ***
## semifurnished  0.136384089  0.023452238   5.815     0.00000001311493 ***
## furnished      0.139402409  0.026699824   5.221     0.00000029770305 ***
## ac             0.142091507  0.023231442   6.116     0.00000000244205 ***
## neighborhood   0.127632227  0.024725726   5.162     0.00000040033935 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1835 on 369 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.7098 
## F-statistic: 59.86 on 16 and 369 DF,  p-value: < 0.00000000000000022

\(~\)

Compare predicted price to actual price for the training data.

\(~\)

Run selected model against validation set.

## 
## Call:
## lm(formula = formula(lm_mod7), data = dfeval_clean, weights = lm_valid1$w)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35051 -0.13513  0.00779  0.11523  0.39211 
## 
## Coefficients:
##                  Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)   14.59822860  0.05808238 251.337 < 0.0000000000000002 ***
## area           0.00004290  0.00000927   4.628           0.00000825 ***
## mainroad       0.15007215  0.04437113   3.382             0.000929 ***
## guestroom      0.06257943  0.04774515   1.311             0.192076    
## basement       0.09440512  0.04001255   2.359             0.019667 *  
## bed3           0.06515962  0.04355248   1.496             0.136842    
## bed4           0.02207513  0.05809029   0.380             0.704503    
## bath2          0.18985130  0.03815093   4.976           0.00000185 ***
## floor2         0.00991900  0.03985083   0.249             0.803795    
## floor3         0.15233007  0.06287172   2.423             0.016658 *  
## floor4plus     0.31853494  0.08226389   3.872             0.000164 ***
## car1           0.11779078  0.04379297   2.690             0.008008 ** 
## car2           0.13487369  0.04527726   2.979             0.003404 ** 
## semifurnished  0.11957014  0.03749098   3.189             0.001755 ** 
## furnished      0.05721555  0.04313205   1.327             0.186797    
## ac             0.18046066  0.03537770   5.101           0.00000106 ***
## neighborhood   0.08674482  0.04148445   2.091             0.038308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1842 on 142 degrees of freedom
## Multiple R-squared:  0.7384, Adjusted R-squared:  0.7089 
## F-statistic: 25.05 on 16 and 142 DF,  p-value: < 0.00000000000000022
## [1] "--------------------------------------------------"
## lm(formula = formula(lm_mod7), data = dfeval_clean, weights = lm_valid1$w)

## [1] ""
## [1] "Shapiro test for normality: The p-value of 0.121694773959435 is > 0.05, so do not reject the null; i.e., the residuals are NORMAL"
## [1] ""
## [1] "Breusch-Pagan test for homoschedasticity: The p-value of 0.772092876772661 and test statistic of 11.5838980266342 are inconclusive, so homoschedasticity can't be determined using this test. But since the p-value is > 0.05, it is reasonable to conclude that the residuals are HOMOSCHEDASTIC."
## [1] ""
## [1] "Variance inflation factor (VIF)"
## [1] "<=1: not correlated, 1-5: moderately correlated, >5: strongly correlated"
##          bed4          bed3        floor2        floor3          area 
##      2.046080      2.028076      1.739436      1.625511      1.590490 
##      basement    floor4plus semifurnished     furnished  neighborhood 
##      1.546923      1.509985      1.504967      1.482755      1.436899 
##          car2         bath2     guestroom          car1      mainroad 
##      1.317953      1.271320      1.269881      1.259422      1.234909 
##            ac 
##      1.196139 
## [1] ""
## [1] "Model scores:"
## [1] "    adjusted R-squared: 0.709"
## [1] "    AIC: -56.184"
## [1] "    BIC: -0.943"
## [1] "    Mallow's Cp: 70.416"
## [1] "    mean squared error: 0.03"
## [1] ""
## [1] "Leverage point cutoff: 0.226415094339623"
## [1] ""
## [1] "First 10 points of influence:"
## [1] "    case #3: 0.243"
## [1] ""

\(~\)

Compare predicted price to actual price for the evaluation data.

# Compare predicted price to actual (eval data)
dfeval_clean$pred_price <- exp(predict(lm_valid2, weights=lm_valid1$w, data=dfeval_clean, interval='prediction')[,1])
dfeval_clean %>% ggplot(mapping=aes(x=price, y=pred_price)) +
    geom_point() +
    geom_smooth(method='lm', se=T) +
    xlab('Price') + ylab('Predicted Price') +
    ggtitle('Figure 16. Predicted Price vs Price (Validation Data)') +
  theme_ipsum()

\(~\)

Our model comparison below:
Table 3. Summary of results
# Train/Validation Linear/Robust Full/Step-reduced Log Transform Outliers Removed Huber-Weighted Adj R-Sqr
1 Train Linear Full 0.675
2 Train Linear Step 0.677
3 Train Linear Step 0.670
4 Train Linear Full Yes 0.691
5 Train Linear Step Yes 0.691
6 Train Linear Step Yes Yes 0.684
7 Train Linear Step Yes Yes 0.684
8 Train Robust Step Yes NA
9 Train Linear Step Yes Yes 0.710
10 Train Linear Step Yes Yes Yes 0.710
11 Validation Linear Step Yes Yes 0.709